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(57) Abstract 

A domain name is acquired from a domain name registry. It is determined whether the domain name represents any functioning Web 
site. If the domain name is associated with at least one functioning Web site, it is recorded that the domain name represents at least one 
functioning Web site. A set of criteria is acquired. A set of entities is identified that meet the criteria. It is determined how many entities 
in the set of entities are registered as having control over at least one Web site. 
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ANALYZING INTERNET-BASED INFORMATION 

Cross-Reference to Related Applications 
This application claims the benefit of United States Provisional Application 
Serial No. 60/097029 entitled "Collecting, Combining, Analyzing, and Using 
5 Internet and Business Information" filed on August 17, 1998, which is incorporated 
herein. 

Background of the Invention 
This application relates to analyzing Internet-based information. 
World-Wide Web ("Web") protocols call for text-based addressing 
10 information, which is highly suitable for human users, to be converted to number- 
based addressing information, which is highly suitable for computers. Much of 
the information available on the Web is organized into Web pages that can be 
retrieved and displayed by Web browser software ("browser") under the direction 
of a user. Each of the Web pages is identifiable by a respective Uniform Resource 
15 Locator text string ("URL"), such as "http:// www .isp321.com/frontpage.html", 
that the browser can use to select the page. Each URL includes a domain name, 
such as "isp321.com", that identifies the Web site where the corresponding Web 
page is stored for retrieval by browser software. Each domain name is registered 
by an entity that controls the corresponding Web site and Web pages. A domain 
20 name registry organization maintains the domain name registration information, 
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which may include name, address, and other information that allows the 
organization to bill the entity for payment for the maintenance. (It is to be 
understood that the term "registry", as used herein, also refers to a domain name 
registrar or any other entity that may provide assistance in registering a domain 
5 name.) When the domain name is registered, the entity identifies a domain name 
server computer system that stores a numeric address (known as an IP address) 
that corresponds to the domain name, and the domain name registry stores the 
identity of the domain name server computer system together with the domain 
name in a tile known as a zone file. The domain name registry also reports the 
10 domain name together with the identity of the domain name server computer 
system to a root zone server computer system, which is a high level computer 
system that is responsible for helping other computer systems properly derive IP 
addresses from domain names (e.g., as described below). The root zone server 
computer system receives such reports from effectively all domain name registries 
15 as domain names are registered, and therefore the root zone server computer 

system has a comprehensive list of domain names that are registered on the Web. 

When the Web browser is directed to retrieve information from a Web site 
identified by a URL, the browser must determine the IP address of the Web site to 
which the URL refers. If the browser submits the domain name part of the URL 
20 to the root zone server computer system, the root zone server computer system 
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determines, based on informatton previously supplied by the domain name 
registry, the identity of the domain name server computer system for the domain 
name and reports me identity to .he browser. The browser then refers to the 
domain name server computer system to find out the IP address, and uses the IP 
5 address to attempt to contact the Web site. If the Web site is functioning, the 



HTML 



header") from the Web site. If the Web site is not functioning, the browser 
receives no response from the Web site, and times out indicating an error. 

An Internet service provider ("ISP") is an example of an entity that may 
,0 have a registered domain name for , Web site. Typically, an ISP has customers 
such as individuals or businesses for whom the ISP stores Web pages on the Web 
site for retrieval by Web browser software. For example, the ISP may have a 
customer Maple Street Plumbing for which the BP stores a home Web page 
having a URL drat includes a prefix -http://wwwisp321.com/-maplestplumb-. 
15 A home Web page is typically the only or the primary entry poin, into a Web site 
or a set of Web pages mat are under the control of an entity. 

Another example of an entity mat may have a registered domain name is a 
Web portal site such as "Yahoo.com- that maintains, in pages organized by 
categories, links to Web sites and home pages that are under the control of other 
20 enfittes. TypicaUy, a Web portal site al.ows another entity to create a link from 
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the Web portal site to the other entit/s Web site or home page by submitting 

information to the Web portal site. 

A Web search engine site ("search engine") maintains and updates a search 
engine database, i.e., a Web page record database, that includes a Web page 
5 record for every Web page that has been turned up by Web sweeping software 
that sweeps the World Wide Web for any and all Web pages. A typical Web page 
record includes a URL for the respective Web page, an excerpt or other subset of 
the information provided by the Web page, and a date indicating the most recent 
update of the Web page record. When a user directs a Web search engine to 
10 execute a search, the Web page record database is searched and then search 

engine results are displayed to the user in the form of a list of Web page records. 

The Web sweeping software discovers information on the Web, including 
domain names previously unknown to the search engine, by following links 
among Web pages. 

15 Some information about an entity may not be available on a Web site that 

is under the control of the entity. For example, public financial information about 
a company may be stored in a database that is not linked to the company's Web 
site or is not directly accessible by Web browser software, such as a database 
under the control of a financial services firm. 
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In general, statistical information regarding Web activity for companies or 
other entities in a particular sector of human activity, such as an industrial sector, 
are expressed in broad terms such as the total number of uniquely qualified 
domain name Web sites ("unique Web sites") that are sponsored by all of the 
companies in a particular industrial sector. Such statistics may prove misleading 
for at least some purposes. For example, if ten companies belong to a sector that 
is known to have ten unique Web sites, the resulting average, i.e., one unique 
Web site per company, can make the sector appear to be well represented by 
unique Web sites, even if in actuality all ten unique Web sites belong to only one 
of the companies and none of the other nine companies has a unique Web site. 



Summary of thp Invention 
Methods and systems are provided for analyzing Internet-based 
information. An Internet analysis system is provided that gathers domain names 
and determines whether the domain names are associated with functioning Web 
sites. A variation of the Internet analysis system that includes an entity 
information database and a mapping database is able to generate reports 
regarding Web activity in sectors of human activity such as industrial sectors. 

Different aspects of the invention allow one or more of the following. A 
comprehensive list of tested domain names can be produced. Domain names for 
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Web sites that are difficult or impossible for a search engine to discover can be 
made available to the search engine to allow the search engine to produce search 
results that account for the contents of the previously undiscovered Web sites. 
Before being provided to the search engine, the domain names may be prioritized 
5 or sorted according to one or more attributes (such as industry sector or company 
size) of the respective entities that are registered as having control over the 
domain names. Highly useful statistics can be produced concerning the number 
of entities in an industrial sector that are registered as having control over Web 
sites. Such statistics can be used for highly effective marketing or sales 
10 approaches in which Web oriented products are targeted at potential customers in 
industrial sectors that are shown by the statistics to have substantial Web activity. 

Other features and advantages will become apparent from the following 
description, including the drawings, and from the claims. 



15 



Rrigf Dpsrri ption o f the Drawings 
Figs. 1, 6, and 7 are block diagrams of computer-based systems. 
Figs. 2, 3, 4, 5, and 10 are flow diagrams of computer-based procedures. 
Figs. 8 and 11 are illustrations of output produced by software. 
Figs. 9A-9B are illustrations of computer data. 



20 
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Detailed Description 
Fig. 1 illustrates an Internet analysis system 110 in which a domain names 
analysis application 112 executes a procedure 1000 (Fig. 2) to collect domain 
names 114 from a domain name source 116 (step 1010), test the domain names to 
determine which of the domain names are domain names that correspond to 
functioning Web sites ("live domain names" 115) (step 1020), and deliver live 
domain names to a search engine 116 for use in searching the Web (step 1030). 

The domain names analysis application collects domain names as follows. 
The domain name source may include a domain name registry or a root zone 
server or both. To collect domain names from the domain name registry, the 
domain names analysis application executes a procedure 2000 (Fig. 3) to submit a 
request to the domain name registry for a zone file (step 2010), download the 
requested zone file (step 2020), and extract domain names from the requested 
zone tile (step 2030). In a specific embodiment, the zone file is downloaded by 
use of a binary transfer procedure known as an FTP transfer. Fig. 11 illustrates an 
example of a portion of a zone file and extracted domain names. The example is 
divided into sections. In a typical section, as shown in the example, the first line 
includes the domain name (in the first column) and its corresponding domain 
name server (in the last column), and the next line lists the domain name server 
(in the first column) and its actual IP address (in the last column). If the domain 
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name has more than one domain name server, the section may include additional 
lines, each including name and IP address information for another domain name 
server. After the zone file is downloaded, the domain names are extracted and 
duplicate domain names are removed. 

To collect domain names from the root zone server, the domain names 
lalysis application executes a procedure 3000 (Fig. 4) to request domain name 
information record by record from the root zone server (step 3010) and extract the 
domain names from the domain name information (step 3020). In a specific 
embodiment, domain names are collected from a root zone server as follows. 
10 First, the root zone server (e.g., F.Root-Servers.net) is selected and data from the 
root zone server is directed into a file; the following is a sequence in which the F 
root server is selected and the F root server is directed to unload all data that 
ends in "com" into a file called "com.txf 

> nslookup 

15 > server f.root-servers.net 

> Is com > com.txt 

Next, a Perl program is executed to extract the domain names from the file 
in accordance with the principles described above in connection with the zone file 
example. 



it 
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Finally, the process, with appropriate changes, is repeated as necessary to 
collect other domain names. Since different root zone servers are responsible for 
different domain name extensions (e.g., "com", "net", "edu", "ca", "uk"), collecting a 
comprehensive list of domain names requires gathering domain name information 
from multiple root zone servers. In a specific embodiment, other root zone 
servers are identified by use of a "whois" command. For example, to identify a 
root zone server that is responsible for "ca" which is the domain name extension 
for Canada, the following command line is used. 
> whois ca-dom 

The response generated in this case is "relay.cdnnet.ca". Domain names are 
gathered from the "relay.cdnnet.ca" server by using a variation of the process 
described above, which variation uses "relay.cdnnet.ca" in place of 
"f.root-server.net". 

To test a domain name, the domain names analysis application executes a 
procedure 4000 (Fig. 5) to attempt to acquire the IP address associated with the 
domain name (step 4010) and, if the IP address is acquired, to attempt, by a 
request known as an HTTP protocol query, to retrieve an HTML header from a 
server having the IP address (step 4020). In a specific implementation, a prefix 
•• . , <' i s added to the domain name to form a URL, and the URL is handled 



www. 



much as a typical URL is handled by Web browser software. For example, a 
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protocol known as telnet is used to attempt to connect to a Web server and 
retrieve an HTML header. The following command lines illustrate an example for 
the case of the "uspto.gov" domain name: 
> telnet www.uspto.gov 80 
5 > dump 

If both attempts 4010, 4020 succeed, the domain name is determined to be a 
live domain name (step 4030). If either attempt fails, the domain name is 
determined not to be a live domain name (step 4040). In the case of the attempt 
to retrieve the HTML header, failure takes the form of a timing out, because the 
10 domain names analysis application fails to receive any response. If the Web site 
returns an error page, it does not qualify as a failure, because the error page 
includes the HTML header. In such a case, the Web site is functioning, but its 
contents may be corrupted or may be blocked by security arrangements. 

The live domain names are delivered to the search engine as a list that is 
15 added to the search engine's list of domain names to be searched for content to be 
recorded in the search engine's index. In a variation, all or some of the domain 
names are delivered after being sorted in accordance with a prioritization scheme 
that takes into account information, gathered from mapping and entity 
information databases (described below), pertaining to respective entities that are 
20 registered as having control over the domain names. For example, the only 
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domain names delivered may be domain names registered as being under the 
control of telecommunications companies or companies in a particular city. 

Fig. 6 illustrates a variation 200 of the Internet analysis system that allows a 
user to produce research reports regarding Web activity in sectors of human 
activity. System 200 includes a mapping database 12 (Fig. 7) that maps URLs or 
domain names 14 to entities 16 such as people, businesses, or government 
agencies, as described in more detail below. For example, the mapping database 
may indicate that any URL that begins with "http://www.uspto.gov" is for a Web 
page controlled by the U.S. Patent and Trademark Office, or that domain names 
"elmstdogs.com" and "elmstcats.com" are under the control of a company named 
Elm Street Pets, Inc. System 200 also includes a Web activity analysis application 
202 (described below) and an entity information database 28 (described below) 
that includes information such as geographic information about entities to which 
URLs or domain names are mapped in the mapping database. 

The mapping database may use a unique identification number ("unique 
ID"), such as a 9-digit American Business Information ("ABI") number or a 
DUNNS number, to identify an entity so that other information about the entity 
can be retrieved from the entity information database or elsewhere by searching 
under the unique ID. (ABI numbers are sponsored by infoUSA.) For example, 
unique IDs from the mapping database may be used to search the entity 
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information database to produce a subset of the mapping database that has 
records only for entities having a particular characteristic, such as a. particular 
geographic location or between 1000 and 5000 employees. 

With respect to the mapping database, where an entity constitutes a portion 
of another entity, each of the entities may be assigned different unique IDs, and 
the different unique IDs may be linked in the mapping database to note the 
relationship among the entities. For example, a company that has offices in 
different locations may be assigned a unique ID for the company itself and a 
respective different unique ID for each location. In another example, when two 
previously unrelated companies merge or one is acquired by the other, each may 
retain its unique ID and a new, different unique ID may be assigned to the 
combination of the two companies, or both companies may be assigned the same 
unique ID. 

Information in the mapping database may be derived from information 
submitted by or on behalf of the entity when a domain name is registered. For 
example, when the company Elm Street Pets, Inc. registers the domain names 
"elmstdogs.com" and "elmstcats.com" with a domain name registry, the company 
associates the domain names with at least enough information, such as name, 
address, and telephone number information, to allow the domain name registry to 
bill the company for maintenance of the registration. 
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The entity may submit information to the mapping database in other ways 
such as in an on-line questionnaire that feeds the mapping database. 

Information in the mapping database may be derived from information 
provided by an intermediary such as an ISP or an Internet portal. For example, an 
5 ISP having a domain name "isp321.com" may have a customer Maple Street 
Plumbing for which the ISP hosts and administers a home page having a home 
page address "www.isp321.com/~maplestplumb". In such a case, the ISP may 
have name, address, and telephone number information for the purpose of billing 
Maple Street Plumbing for such hosting and administration, and may allow such 
10 information along with the home page address to be used to link the home page 
address to Maple Street Plumbing in the mapping database. 

In another example, an Internet portal may allow an entity such as Maple 
Street Plumbing to create an entry or listing named "Maple Street Plumbing" in a 
"plumbing" section of a on-line directory maintained by the portal, to allow a user 
15 to view home page "www.isp321.com/-maplestplumb" by selecting the entry. In 
such a case, the Internet portal may allow information in the entry, and perhaps 
any address and telephone number information submitted by the entity during 
creation of the entry, to be used to link the home page to Maple Street Plumbing 
in the mapping database. 
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The mapping database, the entity information database, and the Web 
analysis application allow a report such as report 204 (Fig. 8) to be generated that 
shows, in absolute numbers and as a percentage, how many entities in an 
industrial sector are registered as controlling one or more Web sites. The entities 
5 included in the report may also or instead be limited by geographical area or by 
any other attribute stored for entities in the entity information database (Figs. 9A- 
9B illustrate a list of such attributes). To generate the report, the Web analysis 
application executes a procedure 5000 (Fig. 10) to search the entity information 
database for entities that match an industrial sector code such as an SIC code or a 
10 North American Industrial Classification System ("NAICS") code ("sector entities") 
(step 5010), determine from the mapping database which of the sector entities are 
registered as controlling one or more Web sites ("Web entities") (step 5020), and 
account for each of the sector entities and Web entities in the report (step 5030), 
such as by presenting quantities for sector entities and Web entities and indicating 
15 the number of Web entities as a percentage of the sector entities. 

Other reports, such as time based reports, can also be generated by the 
Web analysis application. For example, the percentage of sector entities that are 
Web entities can be tracked over time to demonstrate the growth in the number or 
percentage of entities that are registered as controlling one or more Web sites 
20 ("online penetration"). By limiting the entities in the report by entity size (e.g., 
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number of employees) or other attribute (e.g., obtained from the entity 
information database), the report can demonstrate other aspects of Web activity, 
such as the difference in online penetration among large, medium, and small 
companies, or which industrial sectors have the most or least online penetration. 

The mapping database and applications based on the mapping database 
may take advantage of a hierarchical organization of Web pages, by treating 
similarly a mapped page and all pages below the mapped page, such as pages 
sharing a particular prefix with the mapped page. For example, all pages sharing 
the prefix "http://www.isp321.com" may be treated as being under the control of 

an ISP named Global ISP Co. 

The mapping database may map an entity to Web pages maintained at 
different Web sites. For example, Maple Street Plumbing may have a first set of 
Web pages at the Global ISP Co. site and a second set of Web pages at another 
ISP's site. 

The entity information database may include a database such as EDGAR 
that includes information about companies. 

Information in the mapping database or the entity information database 
may allow searches to be limited by relative size of entities, such as size in an 
industry. 
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One or more of the databases referenced above may be or include a 
relational database and may have records to which fields may be added readily. 

Any of many different types of computer equipment may be used. For 
example, one or more Intel-based personal computers may be used that run an 
5 SQL database on Linux and that programs written in Perl or the C programming 
language with interfaces to the SQL database. 

The technique (i.e., the procedures described above) may be implemented 
in hardware or software, or a combination of both. In at least some cases, it is 
advantageous if the technique is implemented in computer programs executing on 
10 one or more programmable computers, such as a personal computer running or 
able to run an operating system such as Unix, Linux, Microsoft Windows 95, 98, 
or NT, or Macintosh OS, that each include a processor, a storage medium readable 
by the processor (including volatile and non-volatile memory and/or storage 
elements), at least one input device such as a keyboard, and at least one output 
15 device. Program code is applied to data entered using the input device to 

perform the technique described above and to generate output information. The 
output information is applied to one or more output devices such as a display 

screen of the computer. 

In at least some cases, it is advantageous if each program is implemented in 
20 a high level procedural or object-oriented programming language such as Perl, C, 
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C++, or Java to communicate with a computer system. However, the programs 
can be implemented in assembly or machine language, if desired. In any case, the 
language may be a compiled or interpreted language. 

In at least some cases, it is advantageous if each such computer program is 
stored on a storage medium or device, such as ROM or optical or magnetic disc, 
that is readable by a general or special purpose programmable computer for 
configuring and operating the computer when the storage medium or device is 
read by the computer to perform the procedures described in this document. The 
system may also be considered to be implemented as a computer-readable storage 
medium, configured with a computer program, where the storage medium so 
configured causes a computer to operate in a specific and predefined manner. 

Other embodiments are within the scope of the following claims. For 
example, the user may be a human being or a non-human entity such as a 
computer program or an automated device that may interact with one or more of 
the databases or one or more of the applications via an application programming 
interface ("API") or a network message. An on-line information store or multiple 
databases may serve as the entity information database, which may take the form 
of any mechanism that provides automated access to information, such as a 
spreadsheet file or a store of email messages. System 110 may also refer to the 
mapping and entity information databases before reporting the live domain names 
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to the search engine. For example, by referring to the mapping and entity 
information databases, system 110 can retrieve entity information relating to the 
live domain names, and can sort the live domain names according to the entity 
information, such as by listing first the live domain names that pertain to an 
5 industry that is indicated as being particularly relevant to the search engine or 
users of the search engine. 
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What is claimed is: 

Claims 

1. A method comprising: 

acquiring a domain name from a domain name registry; 

determining whether the domain name represents any functioning Web 

site; and 

if the domain name is associated with at least one functioning Web site, 
recording that the domain name represents at least one functioning Web site. 



2. The method of claim 1, further comprising: 

if it is indicated that the domain name represents at least one functioning 
Web site, submitting the domain name to a search engine. 



3. The method of claim 1, further comprising: 

identifying an entity that is registered as having control over the domain 

name; 

determining whether the entity meets a set of prioritization criteria; and 
submitting the domain name to a search engine only if the entity meets the 
set of prioritization criteria. 
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4. A method comprising: 
acquiring a set of criteria; 

identifying a set of entities that meet the criteria; and 
determining how many entities in the set of entities are registered as 
5 having control over at least one Web site. 



5. The method of claim 4, comprising: 

deriving a statistic from a result of the determination. 



10 6. Computer software, residing on a computer-readable storage medium, 



comprising a 



set of instructions for use in a computer system to cause the 



computer system to: 

acquire a domain name from a domain name registry; 

determine whether the domain name represents any functioning Web site; 

15 and 

if the domain name is associated with at least one functioning Web site, 
record that the domain name represents at least one functioning Web site. 
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7. Computer software, residing on a computer-readable storage medium, 
comprising a set of instructions for use in a computer system to cause the 
computer system to: 

acquire a set of criteria; 

identify a set of entities that meet the criteria; and 

determine how many entities in the set of entities are registered as having 
control over at least one Web site. 

8. A system comprising: 

an acquirer that acquires a domain name from a domain name registry; 

a determiner that determines whether the domain name represents any 
functioning Web site; and 

a recorder that, if the domain name is associated with at least one 
functioning Web site, records that the domain name represents at least one 
functioning Web site. 

9. A system comprising: 

an acquirer that acquires a set of criteria; 

an identifier that identifies a set of entities that meet the criteria; and 
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a determiner that determines how many entities in the set of entities are 
registered as having control over at least one Web site. 
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